Reinforcement learning.html (10315B)
1 2 <!DOCTYPE html> 3 <html> 4 <head> 5 <meta charset="UTF-8"> 6 <link rel="stylesheet" href="pluginAssets/highlight.js/atom-one-light.css"> 7 <title>Reinforcement learning</title> 8 <link rel="stylesheet" href="pluginAssets/katex/katex.css" /><link rel="stylesheet" href="./style.css" /></head> 9 <body> 10 11 <div id="rendered-md"><h1 id="reinforcement-learning">Reinforcement learning</h1> 12 <nav class="table-of-contents"><ul><li><a href="#reinforcement-learning">Reinforcement learning</a><ul><li><a href="#what-is-reinforcement-learning">What is reinforcement learning?</a></li><li><a href="#approaches">Approaches</a><ul><li><a href="#random-search">Random search</a></li><li><a href="#policy-gradient">Policy gradient</a></li><li><a href="#q-learning">Q-learning</a></li></ul></li><li><a href="#alpha-stuff">Alpha-stuff</a><ul><li><a href="#alphago">AlphaGo</a></li><li><a href="#alphazero">AlphaZero</a></li><li><a href="#alphastar">AlphaStar</a></li></ul></li></ul></li></ul></nav><h2 id="what-is-reinforcement-learning">What is reinforcement learning?</h2> 13 <p>Agent is in a state, takes an action.<br> 14 Action is selected by policy - function from states to actions.<br> 15 The environment tells the agent its new state, and provides a reward (number, higher is better).<br> 16 The learner adapts the policy to maximise expectation of future rewards.</p> 17 <p>Markov decision process: optimal policy may not depend on previous state, only info in current state counts.</p> 18 <p><img src="_resources/e78427ef0d0845d0ae21e1c7857d2740.png" alt="90955f3da8fb0d61c2fa9f3033c65098.png"></p> 19 <p>Sparse loss:</p> 20 <ul> 21 <li>start with imitation learning - supervised learning, copying human action</li> 22 <li>reward shaping - guessing reward for intermediate states, or states close to good states</li> 23 <li>auxiliary goals - curiosity, max distance traveled</li> 24 </ul> 25 <p>policy network: NN with input of state, output of action, and a softmax output layer to produce prob distribution.</p> 26 <p>three problems of RL:</p> 27 <ul> 28 <li>non differentiable loss</li> 29 <li>balance exploration and exploitation 30 <ul> 31 <li>this is a classic trade-off in online learning</li> 32 <li>for example, an agent in a maze may train to reach a reward of 1 that's close by and exploit that reward, and so it might never explore further and reach the 100 reward at the end of the maze</li> 33 </ul> 34 </li> 35 <li>delayed reward/sparse loss 36 <ul> 37 <li>you might take an action that causes a negative result, but the result won't show up until some time later</li> 38 <li>for example, if you start studying before an exam, that's a good thing.<br> 39 the issue is that you started one day before, and didn't do jack shit during the preceding two weeks.</li> 40 <li>credit assignment problem: how do you know which action takes the credit for the bad result?</li> 41 </ul> 42 </li> 43 </ul> 44 <p>deterministic policy - every state followed by same action.<br> 45 probabilistic policy - all actions possible, certain actions higher probability.</p> 46 <h2 id="approaches">Approaches</h2> 47 <p>how do you choose the weights (how do you learn)?<br> 48 simple backpropagation doesn't work - we don't have labeled examples to tell us which move to take for given state.</p> 49 <h3 id="random-search">Random search</h3> 50 <p>pick random point m in model space.</p> 51 <pre class="hljs"><code><span class="hljs-attr">loop</span>:<span class="hljs-string"></span> 52 <span class="hljs-attr">pick</span> <span class="hljs-string">random point m' close to m</span> 53 <span class="hljs-attr">if</span> <span class="hljs-string">loss(m') < loss(m):</span> 54 <span class="hljs-attr">m</span> <span class="hljs-string"><- m'</span> 55 </code></pre> 56 <p>"close to" is sampled uniformly among all points with some pre-chosen distance r from w.</p> 57 <h3 id="policy-gradient">Policy gradient</h3> 58 <p>follow some semi-random policy, wait until reach reward state, then label all previous state-action pairs with final outcome.<br> 59 i.e. if some actions were bad, on average will occur more often in sequences ending with negative reward, and on average will be more often labeled as bad.</p> 60 <p><img src="_resources/c484829362004f90be2b33a92acf7fd9.png" alt="442f7f9bc5e14ffbbcfd54f6ea6b72df.png"></p> 61 <p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="normal">∇</mi><msub><mi>𝔼</mi><mi>a</mi></msub><mi>r</mi><mo stretchy="false">(</mo><mi>a</mi><mo stretchy="false">)</mo><mo>=</mo><mi mathvariant="normal">∇</mi><msub><mo>∑</mo><mi>a</mi></msub><mi>p</mi><mo stretchy="false">(</mo><mi>a</mi><mo stretchy="false">)</mo><mi>r</mi><mo stretchy="false">(</mo><mi>a</mi><mo stretchy="false">)</mo><mo>=</mo><msub><mi>𝔼</mi><mi>a</mi></msub><mi>r</mi><mo stretchy="false">(</mo><mi>a</mi><mo stretchy="false">)</mo><mi mathvariant="normal">∇</mi><mi>ln</mi><mo></mo><mrow><mi>p</mi><mo stretchy="false">(</mo><mi>a</mi><mo stretchy="false">)</mo></mrow></mrow><annotation encoding="application/x-tex">\nabla 𝔼_a r(a) = \nabla \sum_{a} p(a) r(a) = 𝔼_{a} r(a) \nabla \ln{p(a)}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord">∇</span><span class="mord"><span class="mord mathbb">E</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.151392em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathdefault mtight">a</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathdefault" style="margin-right:0.02778em;">r</span><span class="mopen">(</span><span class="mord mathdefault">a</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:1.0497100000000001em;vertical-align:-0.29971000000000003em;"></span><span class="mord">∇</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mop"><span class="mop op-symbol small-op" style="position:relative;top:-0.0000050000000000050004em;">∑</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.0016819999999999613em;"><span style="top:-2.40029em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathdefault mtight">a</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.29971000000000003em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathdefault">p</span><span class="mopen">(</span><span class="mord mathdefault">a</span><span class="mclose">)</span><span class="mord mathdefault" style="margin-right:0.02778em;">r</span><span class="mopen">(</span><span class="mord mathdefault">a</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord"><span class="mord mathbb">E</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.151392em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathdefault mtight">a</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathdefault" style="margin-right:0.02778em;">r</span><span class="mopen">(</span><span class="mord mathdefault">a</span><span class="mclose">)</span><span class="mord">∇</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mop">ln</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord"><span class="mord mathdefault">p</span><span class="mopen">(</span><span class="mord mathdefault">a</span><span class="mclose">)</span></span></span></span></span>, r is the ultimate reward at the end of the trajectory.</p> 62 <h3 id="q-learning">Q-learning</h3> 63 <p>If I need this, I'll make better notes, can't really understand it from the slides.</p> 64 <h2 id="alpha-stuff">Alpha-stuff</h2> 65 <h3 id="alphago">AlphaGo</h3> 66 <p>starts with imitation learning.<br> 67 improve by playing against previous iterations and self. trained by reinforcement learning using policy gradient descent to update weights.<br> 68 during play, use Monte Carlo Tree Search, with node values being the prob that black will win from that state.</p> 69 <h3 id="alphazero">AlphaZero</h3> 70 <p>learns from scratch, there's no imitation learning or reward shaping.<br> 71 also applicable to other games like chess.</p> 72 <p>Improves AlphaGo by:</p> 73 <ul> 74 <li>combining policy and value nets</li> 75 <li>viewing MCTS as policy improvement operator</li> 76 <li>adding residual connections, batch normalization</li> 77 </ul> 78 <h3 id="alphastar">AlphaStar</h3> 79 <p>This shit can play starcraft.</p> 80 <p>Real time, imperfect information, large diverse action space, and no single best strategy.<br> 81 Its behaviour is generated by a deep NN that gets input from game interface, and outputs instructions that are an action in the game.</p> 82 <p>it has a transformer torso for units<br> 83 deep LSTM core with autoregressive policy head, and pointer network.<br> 84 makes use of multi-agent learning.</p> 85 </div></div> 86 </body> 87 </html>